This project aims to create a comment classifier capable of assigning specific toxicity labels to text comments, such as insults, identity hate, etc.
Throughout the project, various experiments were conducted using pre-trained NLP models, including RoBERTa and DistilBERT. The outcomes produced a 0.693 AUC-PR and a 0.67 macro average F1-Score.
!kaggle competitions download -c jigsaw-toxic-comment-classification-challenge -p data
Unzip and store files:
zip_file_path = data/jigsaw-toxic-comment-classification-challenge.zip with zipfile.ZipFile(zip_file_path, r) as zip_ref: zip_ref.extractall(data) os.remove(zip_file_path) zip_files=os.listdir(data) for zip in zip_files: zip_path=f”data/{zip}” with zipfile.ZipFile(zip_path, r) as zip_ref: zip_ref.extractall(data) os.remove(zip_path)
Read the data:
Show the code
train_data = pl.read_csv("data/train.csv")
2 Exploratory Data Analysis
Number of samples in the training data:
Show the code
len(train_data)
159571
Overview:
Show the code
train_data.head()
shape: (5, 8)
id
comment_text
toxic
severe_toxic
obscene
threat
insult
identity_hate
str
str
i64
i64
i64
i64
i64
i64
"0000997932d777…
"Explanation Wh…
0
0
0
0
0
0
"000103f0d9cfb6…
"D'aww! He matc…
0
0
0
0
0
0
"000113f07ec002…
"Hey man, I'm r…
0
0
0
0
0
0
"0001b41b1c6bb3…
"" More I can't…
0
0
0
0
0
0
"0001d958c54c6e…
"You, sir, are …
0
0
0
0
0
0
Show the code
toxicity_labels = train_data.columns[-6:]
Are there duplicate ID’s?
Show the code
train_data["id"].is_duplicated().any()
False
Are there duplicate comments?
Show the code
train_data["comment_text"].is_duplicated().any()
False
2.1 Class Balance
Percentage of comments with each label:
Show the code
fig_classes, ax_classes = plt.subplots(figsize=BASE_FIG_SIZE)class_proportion = train_data[:, 2:].sum().to_numpy()[0] /len(train_data)sns.barplot(x=class_proportion *100, y=train_data.columns[2:], ax=ax_classes)ax_classes.xaxis.set_major_formatter(ticker.PercentFormatter())ax_classes.set_xlabel("Percentage of Comments with Label")
Text(0.5, 0, 'Percentage of Comments with Label')
The toxicity labels are rare with the lowest minority labels only being present in less than 1% of the comments.
Can comments have more than one label?
Show the code
label_sums = train_data[:, 2:].sum_horizontal().value_counts()label_sums.columns = ["Total labels", "Number of Comments"]label_sums.sort("Number of Comments")
shape: (7, 2)
Total labels
Number of Comments
i64
u32
6
31
5
385
4
1760
2
3480
3
4209
1
6360
0
143339
Comments can have more than one label.
Fraction of comments with no labels:
Show the code
round( ( label_sums.filter(pl.col("Total labels") ==0)["Number of Comments"]/len(train_data) ).item(),2,)
In order to see the number of tokens in each comment the comments were tokenized using the BERT uncased tokenizer. Toxic comments seem to have a lower median length than benign comments.
To get a good compromise between long term token relationship comprehension by the model and processing time. The max_length parameter of tokenizers used will be set to the 90th percentile of the lengths of toxic comments.
2.3 Language Detection
Using a language detection model to predict the language in each comment:
Show the code
def detect_language(text):try: result = detect_langs(text)[0].langexcept: result ="empty"return resultlanguages = train_data["comment_text"].map_elements(detect_language)joblib.dump(languages, "temp/languages.joblib")
Upon further inspection of some of these comments it is evident that most of them are in fact in english and was flagged as not due to the model’s imperfections. Therefore no comments will be omitted.
3 Modeling
Class interaction schema:
The data preprocessing and model training process are established using several custom classes that inherit from PyTorch classes. The setup is illustrated in the schema above. The values of parameters highlighted in red are adjustable through a configuration file, which is employed to instantiate a dataloader, model, and trainer.
The parameters in the configuration file are as follows: - data: polars DataFrame or string directory - batch_size: Batch Size - model: Text name of the model to be used in the AutoModel and AutoTokenizer .from_pretrained() function - tokenizer_max_len: Maximum length in the tokenizer to which tokenized text is padded or truncated - class_weights: balanced or None. Balanced weighs each label differently based on its frequency, assigning higher weights to rare labels. - learning_r: Initial learning rate - stop_patience: Number of epochs to continue after the training delta limit is reached - stop_delta: Lowest difference between the current and last epoch validation loss, after which the training is stopped - unfreeze_delta: Lowest difference between the current and last epoch validation loss, after which all model layers are unfrozen - tuning_lr: Learning rate of the model’s backbone - max_epochs: The maximum number of epochs if not stopped earlier - dropout: If not None, a float fraction corresponding to the dropout rate of a dropout layer before the final classification layer - under_sample: Use only this fraction of benign comments in the training data - train_frac: Use only a fraction of the training data - val_frac: Use only a fraction of the validation data - test_frac: Use only a fraction of the test data - name: Experiment name
Load tensorboard:
Show the code
%load_ext tensorboard
3.1 Experiment 1: DistilBERT
The first experiment uses DistilBERT as a base model.
3.3 Experiment 3: DistilBERT with a lower learning rate and batch size
Using a more complex model did not improve the result and hindered the training time. Therefore the next experiment is done using DistilBERT, but this time with a lower learning rate and a lower batch size.
3.4 Experiment 4: DistilBERT with an additional dropout layer
Lowering the learning rate and the batch size slightly improved the performance. Next an extra dropout layer with a rate of 0.25 is added before the final classification layer to regularize the model better.
Adding an extra dropout layer did not improve the performance of the model but it did improve the training speed causing the model to converge in a lesser number of epochs.
It is evident that the model highly favours recall over a large decision threshold range. Due to this fact to optimize for the f1-score (increasing precision with minimal losses to recall) thresholds for each class had to be increased.
The explainer highlights the most important words in each comment for a specific label. Yet it is far more interesting to see the interpretations in cases where mistakes were made.
Miss classification of toxic comments seem to be less of a problem as the probabilities for toxicity were still high, thus a high recall can be achieved by lowering the threshold rate if needed.
5.2 Kaggle Score
The following cells were used to make a late submission to kaggle, it achieved a score of 0.979.
Loading Test Data:
Show the code
test_data = pl.read_csv("data/test.csv")for i inrange(6): test_data = test_data.with_columns(pl.zeros(len(test_data)).alias(str(i)))
Making predictions:
Show the code
test_dataset = CommentDataModule.CommentDataset( test_data, test_loader_distilbert_02.dataset.tokenizer, 166)testset_loader = torch.utils.data.DataLoader(test_dataset, batch_size=8, num_workers=0)model_distilbert_02.eval()test_predicts = trainer_distilbert_02.predict(model_distilbert_02, testset_loader)test_predicts = torch.cat([batch.sigmoid() for batch in test_predicts], dim=0)clear_output()
2.2 Comment Length
Comment Length Distribution:
Show the code
Most of the comments have less than a 1000 symbols and are capped ant 5000 symbols.
Filtering out comments with no letters:
Show the code
Number of tokens:
Show the code
In order to see the number of tokens in each comment the comments were tokenized using the BERT uncased tokenizer. Toxic comments seem to have a lower median length than benign comments.
90th percentile of tokenized comment length:
Show the code
To get a good compromise between long term token relationship comprehension by the model and processing time. The max_length parameter of tokenizers used will be set to the 90th percentile of the lengths of toxic comments.